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ABSTRACT 

Applications such as Google Docs, Office 365, and Drop- 
box show a growing trend towards incorporating multi-user 
live collaboration functionality into web applications. These 
collaborative applications share a need to efficiently express 
shared state, and a common strategy for doing so is a shared 
log abstraction. Extensive research efforts on log abstrac- 
tions by the database, programming languages, and distributed 
systems communities have identified a variety of optimiza- 
tion techniques based on the algebraic properties of updates 
(i.e., pairwise commutativity, subsumption, and idempotence). 
Although these techniques have been applied to specific ap- 
plications and use-cases, to the best of our knowledge, no 
attempt has been made to create a general framework for 
such optimizations in the context of a non-trivial update lan- 
guage. In this paper, we introduce mutation languages, a 
low-level framework for reasoning about the algebraic prop- 
erties of state updates, or mutations. We define Bai-Q£, a 
general purpose state-update language, and show how muta- 
tion languages allow us to reason about the algebraic prop- 
erties of updates expressed in Bar^. 



1. INTRODUCTION 

Over the past several years, many web applications have 
been released that duplicate and improve on the function- 
ality of desktop applications (e.g. Google Docs). A natu- 
ral consequence of this shift from the desktop to the web 
is that applications have become more collaborative. Fully 
featured word processors, presentation editors, spreadsheets, 
and drawing programs now exist that allow users to collabo- 
ratively edit, view, and annotate documents in "real-time." 
Although these collaborative applications are structured us- 
ing a client/server model, the core functionality of the appli- 
cation is typically built into the client. The server's primary 
role is solely to relay state updates between clients. In spite 
of this apparent structural simplicity, collaborative applica- 
tion developers must still expend substantial effort to build 
scalable and efficient infrastructures for their applications. 
To address this concern, we present the theoretical founda- 
tions for a generalized server infrastructure for collabora- 



tive applications: Laasi^]. Laasie's primary goal is to en- 
code and replicate application state through a distributed log 
datastructure. Clients perform changes to application state 
by appending them to the log. 

The primary motivation for this design is to allow clients to 
easily recover from link failures (e.g., when the host plat- 
form changes networks or after it wakes from sleep mode) 
by maintaining a pointer to the most recent log entry that 
they have seen. The server can bring a client up to the most 
recent state by replaying all log entries that appear after the 
client's pointer. 

Crucially, updates are expressed in the log in terms of intent 
rather than effect. Below, we introduce and discuss Bar^, 
an update language that can express conditionals and itera- 
tion over complex hierarchical datatypes. Updates expressed 
in BarQ£ are not evaluated, but rather appended as-is to the 
log. This simplifies the semantics of out-of-order appends 
and makes it easier to express updates as increments (i.e., 
deltas) rather than fixed write operations (e.g., var := 3). 
In short, Barq£ allows the operational semantics of updates 
to be managed as first class data objects. 
Although an append-only log is a useful high-level abstrac- 
tion, in practice it becomes necessary to compact the log to 
bound its size. For example, a snapshot of the application 
state can be substituted for all log entries that precede it. 
Unfortunately, eliminating all log entries preceding the snap- 
shot also invalidates all clients at states preceding the snap- 
shot as well. These clients must be (effectively) restarted 
from scratch, negating the benefits of a log. 
In this paper, we present a general framework for reason- 
ing about log updates. We consider two properties of each 
rewrite: (1) Correctness, or whether the rewritten log up- 
dates collectively generate a state identical to the original 
sequence of updates, and (2) Recoverability, or whether the 
rewritten log can be used to bring a client at any state up to 
the most recent state. We then proceed to show how to define 
incremental deletion and composition rewrites of the log and 
provide realistic "real-world" bounds on their behavior. This 
is accomplished through the definition and use of mutation 
languages in the following sections. 
The contributions of this paper are as follows: 

1 Log-As- A-Service InfrastructurE 
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1. The design and formalism of mutation languages, a 
general framework for reasoning about the correctness 
and recoverability of log rewrites, and an analysis of 
the computational complexity of doing so. 

2. The construction of mutation languages for compos- 
ite hierarchical datatypes derived from mutation lan- 
guages for simpler primitive types. 

3. The formal definition of a log -based update language 
named BarQ£. 

4. A reduction from BaiQ£ to a composite mutation lan- 
guage, and computational complexity result for com- 
puting the correctness and recoverability of log rewrites 
for Bai - Q£ 

5. An incremental algorithm for identifying candidate log 
rewrites belonging to two rewrites classes: deletion 
and composition, with amortized constant time com- 
plexity. 

1.1 Roadmap 

Our ultimate goal in this paper is to demonstrate the con- 
struction of a practical log rewrite oracle for a non-trivial 
update language for composite types. For any given rewrite 
of a log, this oracle will determine both the correctness and 
recoverability of the rewrite. Before defining the oracle we 
first define in Section|2]a specification of a nontrivial update 
language (BarQx). We then use this language to formalize 
the notion of update logs and log-rewrites, and provide for- 
mal definitions of the correctness and recoverability of a log 
rewrite. 

In Section [3] we formally define the mutation and mutation 
language abstractions. A mutation is simply an expression 
of change and a mutation family is a collection of mutations 
with properties (e.g., commutativity). We also identify two 
binary operations (merge and compose) over mutations in a 
mutation language that we will use to simplify the translation 
of BarQ£ update queries into equivalent mutations. 
Section [37X1 outlines the construction of a log rewrite ora- 
cle for any mutation language. This construction is based 
on language-specific oracles that evaluate algebraic proper- 
ties of updates (Commutativity, Subsumption, and Idempo- 
tence). 

In Section H] we define a mutation language Z (LBar), and 
show a reduction from Barqx to Z. We provide definitions 
of merge, compose, as well as impractical definitions of the 
algebraic property oracles. Using a Z, we define a practical 
set of algebraic property oracles that allow us to construct a 
log rewrite oracle for Bar^. 

2. HIGH LEVEL SEMANTICS 

In this section we introduce Barq£, a log-based update lan- 
guage loosely based on the Monad Algebra ll25l with unions 
and aggregates. Unlike Monad Algebra, which uses sets as 



the base collection type, Bsltq£ uses maps0 and has weaker 
type semantics along the lines of ifTTIl . Furthermore, Bar^ 
is intentionally limited to operations with linear computa- 
tional complexity in the size of the input data; neither the 
pairwith nor cross-product operations of Monad Algebra are 
included. In our domain, this is not a limitation, as the server 
is acting primarily as a relay for state. Full cross-products 
can be transmitted to clients more efficiently in their fac- 
torized form, and each client is expected to be capable of 
computing cross products locall}0 

The domains and grammar for Bar^x are given in Fig. Q] 
We use C to range over constants, p over primitives (strings, 
integers, floats, and booleans), k over keys, Q over queries, 
X over types, v over values of type T, and 8 over binary op- 
erations over primitive types. The type x operated over by 
B ar Q£ queries is identical to the labeled trees of [111, an d is 
equivalent to unstructured XML or JSON. Values are either 
of primitive type, null, or collections (mappings from k to x). 
Note that collections are total mappings; for instances, a sin- 
gleton can be defined as the collection where all keys except 
one map to the null value. By convention, when referring 
to collections we will implicitly assume the presence of this 
mapping for all keys that are not explicitly specified in the 
rules themselves. 

We formalize BaiQ£ in Fig. [2] in terms of a big-step opera- 
tional semantics. Order of evaluation is defined by the struc- 
ture of the rules. 

In BarQ£ queries are monads, structures that represent com- 
putaiton. Reducing the query corresponds to evaluting the 
computation expressed by that query. The rules for Primi- 
tiveConstant, Null and EmptySet all defined operations take 
an input value and produce a constant value reguardless of 
input. The rule PrimitiveConstant produces a primitive con- 
stant c, the rule Null produces the null value, and the rule 
EmptySet produces a empty set. We define an empty set a 
collection that is a total mapping where all keys map to the 
null value, represented as: {* — > null}. The Identity oper- 
ation passes through the input value unchanged. Subscript- 
ing and Singleton are standard operations. In comparison to 
Monad Algebra, these operations correspond to not only the 
singleton operation over sets, but also the tuple constructor 
and projection operations. Because collection elements are 
identified by keys, we can reference specific elements of the 
collection in much the same way as selection from a tuple. 
The most significant way in which BarQx differs from Monad 
Algebra is its use of the Merge operation (<S=) instead of set 
union (U). <= combines two sets, overwriting undefined en- 
tries (keys for which the collection maps to null) with their 
values from the other collection. 

({A := 1} <= {B := 2}){null) ={A—> 1,B —> 2} 



2 Maps are also popularly referred to as hashes, dictionaries, or 
lookup tables. 

3 Joins are an area of concern however, and future work will con- 
sider extensions to Barqx f° r this purpose. 
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Figure 1: Domains and grammar for Bar^. 



If a key is defined in both collections, the right collection 
takes precedence. 

({A := 1} <= {A := 2})(null) = {A -> 2} 

The merge operator can be combined with singleton and 
identity to define updates to collections: 

(id <= {A : = 3})({A -> l,B -> 2}) = {A -> 3,B -> 2} 

Subscripting can be combined with merge, singleton, and 
identity to define point modifications to collections. 

(id <= {A := (id.A <= {B := 2})})({A -> {C -> 1}}) 

= {A^{B^2,C^1}} 

Primitive binary operators are defined monadically with op- 
eration PrimBinOp, and include basic arithmetic, compar- 
isons, and boolean operations. These operations can be com- 
bined with identity, singleton, and merge to define updates. 
For example, to increment A by 1, we write 

{id <= {A := id.A + 1}}({A -> 2}) = {A -> 3} 

Bar^£ provides constructs for mapping, flattening and ag- 
gregation. The Map operation is analogous to its definition 
in Monad Algebra, save that key names are preserved. The 
Flatten operation is also similar, except that it uses <S=, in- 
stead of U as in Monad Algebra. The Primitive Aggregation 
class of operators defines aggregation using any closed bi- 
nary operator operating over over primitive type. 
To increment all children of the root by 1 we write: 

(map id using (id + 1))({A -> 1 ,B -> 2}) = {A -> 2,5 -> 3} 

To increment the child C of each child of the root by 1 , we 
write 

(map id using (id «- {C := id.C+ 1}))( 

{A -> {C^ {C^2,D^ 1}} 

) = {A^{C^2},B^{C^3,D^ 1}} 

Finally, Bar^x supports Conditionals and Filtering, as well 
as Composition of queries. 

3. MUTATION LANGUAGES 

We will now temporarilly step back from Bar^x in order 
to refine our understanding of update logs. At its simplest, 
an update log encodes a state value as a sequence of state 
mutations applied iteratively, first to a default "empty" state, 
and then to the output of the prior transformation. 



If the fundamental primitive of an update log is the state 
transformation, then the fundamental operation is compo- 
sition of state mutations. As a basis for reasoning about 
the safety properties of changes to this log, we begin with 
an outline for simple algebras over the composition of state 
mutations. 

DEFINITION 1. A mutation is an arbitrary transforma- 
tion M : 1 1-> x mapping values of some state type X to new 
values of the same type. A mutation may be parameterized 
by an set of additional values R. We write such a mutation 
as Mr(v). We say the mutation Mr{v) is: 

«... destructive if Mr is independent of v. 

«... idempotent if\/v,R : M R (v) = M R (M R (v)) 

EXAMPLE 1. Consider an application that encodes its 
state as a single integer (i.e., X = Zj. Such an application 
might employ the two mutations "replace by 0", and "incre- 
ment by 1 ": 

M :=0 (x) i y M ++ (x) x + 1 

The replace operation is both destructive and idempotent. 
The increment operation is neither. 

We can use parameters to create families of mutations. For 
example, we can use a single parameter Y to define a family 
of mutations "replace by Y" (M--y), or "increment by Y" 
(M +=Y j. 

Having defined mutations in the abstract as functions, we 
can now formally define the abstract composition of muta- 
tions as simple left-first function composition. 

(MoM'){x) =M'(M(x)) 

PROPOSITION 1. Composition is associative. 
Proof. By Equivalence 

((MoM')oM")(x) =M"(M'(M(x))) = (Mo [M 1 oM"))(x) 

□ 

We can define a composition algebra for any set of mutations 
M with identical kinds. We consider two properties in this 
algebra: (1) pairwise commutativity and (2) subsumption. 
Unlike the traditional algebraic notion of commutativity, we 
consider only the pairwise commutativity of individual mu- 
tations. That is, instead of saying that o is commutative, we 
say that M and M' commute iff (Mo M') = (M' oM). Sub- 
sumption is also defined pairwise; we say that M' subsumes 
MiffMoM' =M'. 



3 



Primitive Constant Null EmptySet Identity 



c(v) i-4 c null(v) i-4 «w« 0(v) {* -4 nzt/Z} id(v) 

>-» -4 r,...} (g(v) i-4 r 

Singleton 

(Q.k)(v) H4 r {fcey := Q}(v) >-4 {Jt -4 r, * -4 m<//} 



Mer ^ e 6i(v)^4 {*,■->'.-} 62(v) ^4 {fcj r,-} 



(Gi^fl2)(v) H4 {* -4 r I (jt = h = kj) A (((r = rj) A (#7 = nw/Z)) V ((r = r/) A (#7 ^ ni<Z/)))} 

Qcoll(v) i-4 {fe; -4 V;} gi (v) n : p Q 2 (v) ^r 2 :p 

Qma P (vi)^n PrimBinOp 9 £ {+ ' *' ~' ^ =1 AND QR ^ S^l : 

(map Qcon using g mflp )(v) 1-4 {fc; -4 r,- | v, ^ «m«} (Qi8G2)(v) >-4 ri9r 2 



Flatten Gco»( — — Primitive Aggregate ~ '"' / ' 1 '' 



(■8B[«=](Qcoh))(v) H- (v vi <=...) (agg[ 8 ] (fieri/)) (v) H- (((v 6vi )6v 2 )9 . . .) 

IjThenElse ®" md — — ? ^ f Ae " - ^ — r;/ie " gconrf (v) ^ false Q et!ie (v) H> r e/je 

(if ficond then Q, ten else Q e z«.)(v) (-4 r fAe „ (if <2 conrf then & ten else Q e / re )(v) i-4 r e/se 

t-t.» Gco//(v) 1-4 {*,•->■ V,} g com /(v;) fj Gl(v)i-4ri fi 2 (ri) H4r 2 
/•Hfer Composition 

(filter Qcou using Q cond ) (v) 1-4 {A:, -4 v t \ t, A v; ^ «m«} (gi o Q 2 ) (v) ^ r 2 

Figure 2: A formal operational semantics for Bar^. 



DEFINITION 2. A mutation language L is the 4-tuple: 
(z,M,S, C^j consisting of: 

1. A state type X 

2. A set of mutations M of kind X 1— > T. This set must in- 
clude the identity mutation id(x) H» x. 

3. A binary relation S{M,M') that holds ifM is subsumed 
byM'. 

4. A symmetric binary relation C{M,M') that holds ifM 
commutes with M' . 

We will use the shorthand S(M) = S(M,M) to denote the 
unary idempotence relation. 

A mutation language encapsulates the composition algebra 
for a specific set of mutations, together with a set of rules 
for determining idempotence, pairwise commutativity, and 
subsumption on mutations in the language. 

EXAMPLE 2. On simple mutation languages, these prop- 
erties can be determined quite efficiently. For the mutation 
language defined from the mutation language families in Ex- 
ample\I](M=i and M +=Y J, we can define the commutativity 
and subsumption relations by simple structural tests on the 
mutations being related: C(M := y,M :=y ), C(M +=y ,M +=Y '), 
and 5(M,M :=Y ) are the only relations that hold. The identity 
mutation for this language is M +=0 . 



For more complex classes of mutations, this definition can 
be too strong. Consequently, for the remainder of the paper, 
we will limit ourselves to weak mutation languages, where 
the relations S, C are conservative approximations. If the 
relation holds then the corresponding property is guaranteed 
to hold, but not visa versa. 

Finally, we will define two notions of closure for a mutation 
language: First, a mutation language L is closed over com- 
position if the composition of two mutations M,M' G L is 
also in L. 

VM,M' G L : 3M" = (MoM 1 ) G L 

Second, a mutation language L is closed over binary opera- 
tion : T x x M> x if there exists a mutation in L that computes 
the result of applying 9 to the output of mutations M,M' G L. 

VM,M' G L : 3M" G L : M"(x) = (M(x)QM'(x)) 

EXAMPLE 3. Our toy mutation language from Example 
\2\can be shown to be closed over composition, addition, and 
subtraction, but not multiplication. The details of this proof 
are left to the reader. 

We will use the two binary operations compose and merge e 
to denote the result of combining two mutations by composi- 
tion or by binary operation (respectively), for any mutation 
language closed over composition or 8 (respectively). Note 
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that the existence of either function provably demonstrates 
the corresponding type of closure. 

3.1 Mutation Logs 

We now turn to our primary subject: logs. Our goal in this 
section is to develop formalisms, first for the logs them- 
selves, and second for reasoning about how the logs can be 
transformed, or rewritten, while preserving certain critical 
properties. 

A log is a sequence of updates to an application's state, ex- 
pressed as a numbered sequence of mutations: M\ , . . . ,M n . 
A log defines a corresponding sequence of application states: 
vo, . . . , v„. We obtain state v,- by starting with a default state 
vo, and applying mutations Mi,...,M; in order. In other 
words, for a mutation language closed under composition, 
Vj is the result of composing the first x mutations in the log. 

vi = (Mio...oMi)(vo) 

We refer to the subscript of a state or mutaiton as its times- 
tamp (i.e., V; and M, have timestamp x). We define the cur- 
rent state of a log of size n to be the state v n . The current 
state can be recovered from any intermediate state v, by ap- 
plying the composition of all mutations after x. 

v n = {M x+ i 0...0 M n )(vi) 

Recovery is central to the design of Laasie. A client can re- 
cover from a transient disconnection by replaying only those 
mutations that occurred while the client was disconnected, 
rather than forcing it to reload the full application state from 
scratch. 

3.1.1 Log Rewrites 

A log rewrite %^ is defined generally as an operation that 
transforms one sequence of mutations M\,... ,M n into a new 
sequence M\ , . . . ,M ! ,. 

Because of our interest in recovery, we are interested in pre- 
serving a correspondence between timestamps in the pre- 
and post- rewritten states v, and v\ (respectively). Conse- 
quently, we will assume that each pre-rewrite state corre- 
sponds to the post-rewrite state with the same timestamp. 
Note that this limits us to size-preserving rewrites. As we 
will soon see, this can be done without loss of generality. 
We specifically consider two classes of size-preserving log 
rewrites: delete and compose. 

Delete. We can effect a size-preserving deletion rewrite by 
replacing the deleted mutation with the no-op identity oper- 
ation (id). The rewrite %i e i(x), which deletes mutation M x 
is defined as 

(M ... i? x 
' [id ... i = x 

Compose. For a mutation language closed over composi- 
tion, we can merge two mutations into the log into a single 
log entry. The log size is preserved by inserting an id muta- 
tion. For reasons that will soon become clear, the composite 



mutation replaces the mutation with the higher timestamp, 
and the inserted id replaces the mutation with the lower times- 
tamp. The rewrite %^mp(x,y), which merges mutations M x 
and My is defined as 

(Mi ... ig{x,y} 

M' = < id ... i = x 

[ M x oM y ... i = y 

3.1.2 Rewrite Properties 

Now that we have defined log rewrites, we begin to consider 
what constitutes a legitimate log rewrite. We define three 
correctness properties for log rewrites: tail-correctness, re- 
coverability, and 1-recoverability . We will also show how to 
use the subsumption and commutativity relations of a muta- 
tion language S, C to determine when these properties are 
guaranteed to be satisfied, independent of data, for a delete, 
compose, or commute rewrite. 

Tail-Correctness. We start with the simplest of the log- 
rewrite properties. 

DEFINITION 3. A log rewrite is tail-correct if the current 
state v n of the log is identical to the current state v' n of the 
rewritten log. That is: 

(Mi o ... oM„)(v ) = {M[o... oAO(vo) 

Lemma 1 . The rewrite ^jei (x) is tail-correct if M x is 
subsumed by the aggregate composition of all mutations fol- 
lowing it: S(M X , (M x+i oM x+2 o . . . oM„)). 

Proof. The identity operation has no effect on the state, 
and can be inserted anywhere. By subsumption, we have that 

M x o . . .oM n = M x+ \ o . . . oM n 

Thus, v n = v' n □ 

LEMMA 2. The rewrite %z m p(x,y) is tail-correct for any 
mutation language closed over composition ifM x commutes 
with the aggregate composition of all mutations between it 
and M y : C(M X , (M x+ \ o . . . o M y - \ ) ) 

Proof. As before, identity has no effect on the state. If 
x = y — 1, then the merged mutations is equivalent to the 
separate mutations by PropositionQ] Otherwise, by commu- 
tativity, we have that 

M x o . . . oM y -\ = M x+ \ o ... oMy_i oM x 

Once M x and M y are adjacent, they can be merged just as 
before. □ 

EXAMPLE 4. Consider our toy mutation language from 
Example [2] From the subsumption relation S, we can in- 
fer that it is tail-correct to delete any mutation preceding a 
replace mutation (M := y). 

From the commutativity relation C, we can infer that it is 
tail-correct to merge any two mutations in an unbroken se- 
quence of increment mutations (M += y), or to merge a re- 
place mutation with its immediate successor. 
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Recoverability. Although tail-correctness provides a useful 
baseline for further discussions of log rewrites, it only takes 
a single state: the current state into consideration. As such, it 
fails to capture any of the benefits of having a log in the first 
place. We now consider a property that is strictly stronger 
than tail-correctness, and which allows us to reason about 
the possibility of recovery from any intermediate state. We 
start with a per-timestamp notion of recoverability 

DEFINITION 4. A log rewrite is recoverable from times- 
tamp i ( or equivalently state v/J the final state v„ of the 
original log can be obtained by applying the sequence of 
rewritten mutations following timestamp i to the state V;, 
taken from the original log. 

(M\ o ... oM„)(v ) = (M' i+1 o . . . oM' n )(vi) 

Or equivalently (because v; is defined by the original log) 

(Mx o ... oM„)(v ) = (Mi o ... oMioM' i+i o . . . oM' n )(v ) 

DEFINITION 5. A log rewrite is recoverable if it is recov- 
erable from all timestamps in the log (i.e., i £ [0,«])J 

Note that tail-correctness is the special case of recoverability 
from timestamp 0. 

Lemma 3. If the log rewrite %i e \(x) is tail-correct, it is 
recoverable 

Proof. Recoverability from any state Vj s.t. i < x is equiv- 
alent to tail-correctness, because these states are unaffected 
by the rewrite. Recoverability when i > x is guaranteed al- 
ways: The state V; being recovered from is taken before the 
rewrite, and mutations M' x+l , . . . ,M' n are identical to their 
pre-rewrite counterparts. □ 

This proof shows a tight coupling between correctness and 
recoverability, and illustrates an intriguing log partitioning. 
If a rewrite only modifies mutations that fall within a fixed 
range, recoverability "errors" can only occur at states that 
fall within that same range. 

PROPOSITION 2. Let 9^be a tail-correct log rewrite, which 
only alters log entries at timestamps in the range [x,y]. Mu- 
tations outside of this range are unaffected by %. 

is recoverable iff it is recoverable from all states Vj 6 \x,y) 

Proof. The proof is identical to that of LemmaO □ 

LEMMA 4. The rewrite t K £m .-p(x 1 y) is recoverable if it is 
correct, andifM x is idempotent: S(M X ,M X ) 

Proof. From the commutativity property required to show 
correctness, we have that M x o. . . oM,._i = M x+ \ o... oM y -i o 
M x . For all i > x, state V; = (M x o ... oMi)(v x -\). Thus, 
(M' j+1 oM' y )(vi) = (M x o ... oM x oM y )(v x ). By commutativ- 
ity, we can rewrite this expression as M x+ \ . . . oM x oM x oM y . 
By idempotence, this is equivalent to the original rewritten 
expression, and by Proposition [2] the proof devolves to that 
of correctness. 



EXAMPLE 5. Returning to the toy mutation language from 
Example [2] we see that although it is tail-correct to merge 
any two increment mutations, it is not recoverable. 
Consider the log (M := i,M +=2 ,M +=3 ). After applying the 
rewrite ^ mp (2,3), we get (M := i,id,M +=5 ). After the rewrite, 
it is no longer possible to recover from state V2 ( — 3), as the 
mutation M +=2 would effectively be applied twice. 

F-recoverability. The intent of recoverability is to pro- 
tect disconnected clients from reaching an inconsistent state 
when log entries are replayed. However, to guarantee full 
recoverability, we must discard many potentially useful log 
rewrites. In a practical setting, a server will not need to guar- 
antee recoverability for all timestamps. 

DEFINITION 6. Given a set of timestamps t, a log rewrite 
is t -recoverable if it is recoverable from every t £ t. 

By tracking when clients disconnect (regardless of whether 
or not the disconnection is transient), the server can identify 
ranges of log entries over which non-recoverable log rewrite 
can still be performed. 

THEOREM 1 . Let %_beaa tail-correct, but non-recoverable 
log rewrite, Let [x,y] be the minimal range of timestamps af- 
fected by %. is T-recoverable iff (tCi [x,y)) = 0. 

Proof. Follows from Proposition^ 
4. REDUCING Bar^ TO LBAR 

We now apply the principles of mutation languages to Bar^ 
by constructing a weak mutation language Z (LBar) built 
around Bar^. Roughly speaking, this mutation language 
allows a single monolithic Bar^£ query to be subdivided 
into a set of disjoint operations, each applied to a specific 
point in the path hierarchy. This allows us to easily iden- 
tify the write dependencies of a Bar^x query at their finest 
granularity. 

We then transform each subdivided operation into a delta 
form, with a Barq^ query that computes a delta value and a 
merge operator, a binary function that defines how the delta 
value is to be merged with the prior state. This update op- 
erator simplifies the task of determining commutativity and 
subsumption at a fine granularity. 

We also identify the set of points in the path hierarchy that 
each query reads from. This set of points forms the set of 
read dependencies of the query. 

Finally, we use the sets of write dependencies, read depen- 
dencies, and update operators to efficiently compute the com- 
mutativity and subsumption relations C, S for a Bar^x query. 

4.1 LBar 

The typesystem of Z is identical to that of Bar^x. To recap: 
values can be of any primitive type, or a collection, which 
is a mapping from key names of abstract type k to values. 
Collections can be organized into a hierarchy. We use to 
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denote an ordered sequence of key names that defines a path 
through the collection herarchy. 

Point mutations form the basis of Z, and express updates to 
individual paths in a Bar^ hierarchy. A point mutation is 
a 3-tuple (<|>,<2, (9 | 0)}, where (j) is the path being updated 
and Q is a Bar^ expression that computes an update delta 
based on the prior state. Every point mutation is annotated 
with either a binary operation 8, or the overwrite annotation 
0. The annotation indicates the combinator used to merge 
delta value with the original. 

We say that two point mutations are path-disjoint if neither 
point mutation's path is a prefix of the other's. A full muta- 
tion in Z is a set of pairwise path-disjoint point mutations, 
which it applies to the state in parallel; The prior state for 
all point mutations in the set is defined uniformly to be the 
prior state for the full mutation. Thus, all point mutations are 
guaranteed to be isolated in the traditional database sense. 
As a shorthand, we will use co(fW) to denote the write set of 
a full mutation 9A., the set of all paths of point mutations in 
the full mutations: 

w(fliO = {<M <4>,e,9)e^} 

We will also use the shorthand to denote the point mu- 
tation applied to path (j) for all (j) e co(fW). 

4.2 Reduction Algorithm 

We now present an iterative process for transforming Bar 
expressions into Z form. This process begins by creating a 
full mutation consisting of a single point mutation { ( [] , Q, 0) }. 
The algorithm repeatedly selects an arbitrary point mutation 
in the set and tries (1) to subdivide point-mutations in this 
set into finer-grained mutations, and (2) to replace overwrite 
annotations by extracting binary operations from the point- 
mutation's query. This process proceeds up to a fixed point. 

Operator-Extraction. In their simplest incarnations, both 
transformations are applied to point queries of the same gen- 
eral form: 

(4>,(i(M>ee'),0) 

For a 8 that is commutative and associative, any query with 
a id.(j) term can be commuted to the front. 
In this expression, Q' effectively expresses the delta of the 
point update, while combines it with the original value 
id.(j). Consequently, 8 becomes the new combinator, and Q' 
becomes the new update delta. 

Key-Extraction. The merge operator (<=) is associative 
(but not commutative). As with binary operators on prim- 
itive type, we can compute an update delta of expressions 
that derive from id. We start by identifying the change set of 
the original query. We start from a point update of the form: 

<4>,G,o> 

If a query Q returns a value of collection type, its change set 
8(2) is computed as follows: 



• 8(0) = 

. 8({k:=Q'} = {k} 

• 8(id.<|>) = {*} 

• 8(id.cf) / ) = This point mutation can not be subdivided. 

• 8(ma P e' using...) =8(2') 

• 8(fllter 0! using ...) =8(2') 

• 8(g' <=G") = 5(2') u 8(g") 

• 8(if . . . then Q 1 else Q") = 8(2') U 8(2") 

• 8(2'o2 , ') = 8G"[id/2l 

The key * is a special key that refers to all keys in the in- 
put query input. This special key is treated as a distinct key 
in the changeset computation. If it is in the changest for a 
delta query (* 6 8(2)), the point mutation modifies the orig- 
inal value (instead of overwriting it), and can be subdivided 
further as follows. 

We begin by generating a delta computation A k (Q) for each 
subkey k in the changeset. This includes a delta computation 
for the special key *, which will be applied to all keys in the 
input that are not explicitly present in the changeset. 

• Afc(0) = null 

. A k {{k:=Q'}) = Q' 

• A k ({k' ■=Q'})=null 

• Afc(id.(|>) = id.fy.k 

. A* (map 2' using Q") = A k (Q') o Q" 

• At (filter 2' using Q" ) = 

if Afc(e') o 2" then A k (Q') else null 

• A, (2' <= Q") = 

if Ajfc(2") # null then A k (Q") else A k (Q') 

• At(if Ajfc(2) then Q' else Q") = 

if A, (2) then A, (2') elseA*(2")) 

. A,(2'°2")=A*2"[id/2'] 

The resultingexpression can be simplified by partial evalua- 
tion. In many cases, it will be possible to eliminate opera- 
tions over null values. The result is a set of point mutations, 
one for each key k in the changeset, including the special 
key *. Once again, * applies to all children at except those 
explicitly defined (by being present in the changeset). The 
resulting set of point mutations is thus defined as 

{<f fc,A*(2),®> I * e 5(2) a (A k (Q) ± id.4>Jfc)} 

Note that we explicitly exclude the identity mutation, as this 
is effectively a no-op. 
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4.3 Read Dependencies 

We compute the read dependencies of a BaiQ£ query by first 
defining a read-normal form for Bar^. We call a query of 
the form id. k\ .ki •(■••) -k n a point read at path (j) = ki .£2 ■(■••) -k n 
A query is in read-normal form if the subscript operator ap- 
pears only in point reads, or is applied to the special key 
tmp, defined below. As we now show, any valid query can 
be transformed into read-normal form: 

• (g <= Q')-k if Q'.k ^ null then Q'.k else Q.k 

• (map Q using Q'j.kt-t Q.koQ' 

• (agg [<{=] (g)).*i-> 

(agg[^](map Q using {tmp := id.k})).tmp 

• (filter g using Q').k m> if Q.k o Q' then g.£ else null 

• (if g then g else Q").k ^\fQ then g'./t else Q".k 

• (g°g')^^g°(g'-^) 

• %.k 1 — ^ nw// 

Given a query g in read-normal form, we can compute the 
readset of the query p(g) as follows: 

. p(M.<|>) = W 

• p(g^g') = P(<2)up(g') 

• p(map Q using Q') = p(gfl 

• p(gop e g / ) = p(g)Up(g / ) 

• p(agg [eH (g)) = p(g) 

• p (filter g using g) =p(g) 

• p(if Q then g else g") = p(g) Up(g') Up(g") 

• p(g°g')=Pg'[id/g] 

• p(c\null\<D) =0 

4.4 Subsumption and Commutativity 

We are now ready to complete the definition of the mutation 
language 4-tuple for Z by defining a conservative approxi- 
mation of the subsumption and commutativity relations. 

Subsumption. A path (j) is subsumed by a full mutation 94 
if it or one of its ancestors is overwritten by 94, and neither 
(]), nor any of its ancestors or descendents appear in the read 
set of 94. Abusing syntax, we write this as: 

S{^94) = 

{3QA' e oo(fW) : (4/ E 40 a {Mtf] = <4>',e,0») 
A(^ep(flO:(4>'E4>)v(4>E4>')) 

Here, □ denotes the ancestor of relation. 
4 This is a conservative approximation. 



A mutation 94 is subsumed by 94' if all paths in the write 
set of 94 are subsumed by 94': 

S(94, 94') = V4> e (ss{94') : 5(4>, 

Commutativity. Two point mutations applied to the same 
path (]), ((j), Q, 0} and ((j), g', 0'} commute iff commutes with 
0'. Two point mutations applied to different paths, ((j),g,0) 
and ((])', Q' , 0') commute iff each of the following conditions 
holds: (1) (]) is neither an ancestor, nor descendant of (j)', (2) 
()) is neither an ancestor, nor descendant of a path in the read 
set p(g')' an d (3) §' is neither an ancestor, nor descendant 
of a path in the read set p(g). 

Two full mutations commute, if all pairs of point mutations 
commute. Again, abusing syntax: 

C(94,94 ! ) =Vm e 94, m' e 94' : C{m,m') 
5. RELATED WORK 

There has been much work focused on the formalization of 
query languages and database models [3] |4] [26). Much of 
this work is based on monad algebra, Lawvere theories, and 
universal algebra ||23l [7] [5] [22] . Manes et al. 127) showed 
how to implement collection classes using monads. Cluet |fT7) 
is an algebra based query language for an object-oriented 
database system. Our work is based on the same fundamen- 
tal theories. In the following we compare our work to previ- 
ous results. 

Languages for Transforming Hierarchical Data. There 
has been considerable work |[T2l [2] [3] Q] on the transforma- 
tion of hierarchical data. Two approaches have become dom- 
inant in this area: Nested Relational Calculus OTI and the 
Monad Algebra [25 1 . Our own approach is closely based on 
the latter, adapted for use with labeled sets, and with the in- 
tentional exclusion of the superlinear time complexity pair- 
with operator (or equivalently, the cartesian cross-product). 

Semistructured Data. Also closely related is work on man- 
aging semistructured data ifTTI . The vast majority of recent 
efforts in this area have been on querying and transforming 
XML data. One formalization by Koch [24] is also closely 
based on Monad Algebra. Work by Cheney follows a similar 
vein, in particular (F)LUX lfT31 [TBI , a functional language 
for XML updates. In [8 |, Benedikt and Cheney present a 
formalism for synthesizing the output schema of XML trans- 
formations, similar to our notion of the compositional com- 
patibility of mutations. More recently, there has also been 
interest in querying lighter-weight semistructured data rep- 
resentations like JSON||9llT0). 

Algebraic Properties of State Updates. The distributed 
systems community has identified a number of algebraic prop- 
erties of state mutations that are useful in distributed con- 
currency control. Commutativity of updates has been ex- 
plored extensively [34. 32 1, but the typical assumption is that 
a domain-specific commutativity oracle is available, such as 
for edits to textual data [32, 28 1. Our notion of subsumption 
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is quite similar to the Badrinath and Ramamritham ]6]'s re- 
coverability property. Unlike subsumption, this property is 
defined in terms of observable side-effects rather than state, 
but is otherwise identical. Like prior work on commutativity, 
they assume that a domain-specific oracle has been provided. 
Several efforts have been made to understand domain-specific 
reconciliation strategies. Feldman et al. 's Operational Trans- 
forms [21 1 are analogous to our our mutation languages, 
but assume that domain-specific operations analogous to our 
merge operation are available. Perhaps the closest effort to 
our own has been Preguica et aZ.'s IceCube (30], and Ed- 
wards et a/.'s Bayou 1 18 1, each of which exploit a range of 
specific algebraic properties of updates to distributed state. 
However, both systems must be explicitly adapted to specific 
application domains by the construction of domain-specific 
property oracles, or by mapping the application's behavior 
down to a trivial update language. To the best of our knowl- 
edge, none of these areas have been explored in the context 
of a non-trivial state update language. 

Update Sequencing. The use of distributed logs and pub- 
lish/subscribe to apply a canonical order to updates has also 
been explored extensively by the distributed systems and 
database communities. Ellis et al. noted the relevance of 
sequencing to distributed concurrency control |fl9l . Eugster 
et al. identified the usefulness of sequencing updates to dis- 
tributed collection types [20 1 . Domain specific applications 
of similar ideas can be found in work by Ostrowski and Bir- 
man 11291 , Weatherspoon et al. [33 1, and others. 

Intent-Based Updates. The use of intent-based (i.e., op- 
erational) updates appears frequently in database literature, 
especially in the context of distributed databases, where it is 
used to reduce communication overhead. Two concrete ex- 
amples are Ceri and Widom's Starburst [13], and Chang et 
aZ.'sBigTableEl. 
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